Oxford Trecvid 2006 – Notebook Paper 1 High-level Feature Extraction 1.1 Bag of Visual Word Representation

نویسندگان

  • James Philbin
  • Anna Bosch
  • Ondřej Chum
  • Jan-Mark Geusebroek
  • Josef Sivic
  • Andrew Zisserman
چکیده

The Oxford team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, one using sparse and one using dense visual features to learn classifiers for all 39 required concepts, using the training data supplied by MediaMill [29] for the 2005 data. In addition, we also used a face specific classifier, with features computed for specific facial parts, to facilitate answering people-dependent queries such as “government leader”. We submitted 3 different runs for this task. OXVGG_A was the result of using the dense visual features only. OXVGG_OJ was the result of using the sparse visual features for all the concepts, except for “government leader”, “face” and “person”, where we prepended the results from the face classifier. OXVGG_AOJ was a run where we applied rank fusion to merge the outputs from the sparse and dense methods with weightings tuned to the training data, and also prepended the face results for “face”, “person” and “government leader”. In general, the sparse features tended to perform best on the more object based concepts, such as “US flag”, while the dense features performed slightly better on more scene based concepts, such as “military”. Overall, the fused run did the best with a Mean Average (inferred) Precision (MAP) of 0.093, the sparse run came second with a MAP of 0.080, followed by the dense run with a MAP of 0.053. For the interactive search task, we coupled the results generated during the high-level task with methods to facilitate efficient and productive interactive search. Our system allowed for several “expansion” methods based on the sparse and dense features, as well as a novel on the fly face classification system, which coupled a Google Images search with rapid Support Vector Machine (SVM) training and testing to return results containing a particular person within a few minutes. We submitted just one run, OXVGG_TVI, which performed well, winning two categories and coming above the median in 18 out of 24 queries. 1 High-level Feature Extraction Our approach here is to train an SVM for the concept in question, then score all key frames in the test set by the magnitude of their discriminant (the distance from the discriminating hyper-plane), and subsequently rank the test shots by the score of their keyframes. We have developed three methods for this task, each differing in their features and/or kernel. Two of the methods are applicable to general visual categories (such as airplane, mountain and road) and the third is specific to faces. The first two methods differ in that one uses sparse (based on region detectors) monochrome features, and the other uses dense (on a regular pixel grid) colour features. We now describe the three methods in some detail. 1.1 Bag of visual word representation The first approach uses a bag of (visual) words [27] representation for the frames, where positional relationships between features are ignored. This representation has proved successful for classifying images according to whether they contain visual categories (such as cars, horses, etc) by training an SVM [7]. Here we use the kernel formulation proposed by [31]. Features and bag of words representation: We used two types of affine region detectors, Hessian Laplace (HL) [21], and Maximally Stable Extremal Regions (MSER) [20]. Each region is then represented by a SIFT [19] descriptor using intensity only. This combination of detection and description generates features which are approximately invariant to an affine transformation of the image, see figure 1. These features are computed in all (representative and nonrepresentative) keyframes. The ‘visual vocabulary’ is then constructed by vector quantizing the SIFT descriptors of a random subset of features from the training data using Kmeans. The K-means cluster centres define the visual words. Each feature type (HL and MSER) has its own vocabulary. We used two vocabulary sizes of 3,000 and 10,000 words for HL, and one vocabulary size of 3,000 words for MSER, ref-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Oxford TRECVid 2007 \u2013 Notebook paper

The Oxford team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on sparse visual features. One used a standard bag-of-words representation, while the other additionally used a lower-...

متن کامل

TokyoTech's TRECVID2007 Notebook

For the high-level feature extraction task, we use visual features (visual words of keyframe images and motion features), and do not use any audio information. Maximum entropy models [1] are employed to model these visual features. Because there was a material mistake in our submission, the inferred Average Precisions of our runs was almost zero. Therefore, in this notebook, we also show the re...

متن کامل

Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and search

In this paper, we present our approaches and results of high-level feature extraction and automatic video search in TRECVID-2007. In high-level feature extraction, our main focus is to explore the upper limit of bag-of-visualwords (BoW) approach based upon local appearance features. We study and evaluate several factors which could impact the performance of BoW. By considering these important f...

متن کامل

University of Sheffield at TRECVID 2006 High-level Feature Extraction

We present our approach to TRECVID 2006, high-level feature extraction task. We submitted one run with type ‘A’, annotating all required 39 features. The approach was based on textual information extracted from speech recogniser and machine translation outputs. They were aligned with shots and associated with highlevel feature references. A list of significant words was created for each feature...

متن کامل

TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS

We participated in the high-level feature extraction task in TRECVID 2007. This paper describes the details of our system for the task. For feature extraction, we propose an EMD-based bag-of-feature method to exploit visual/spatial information, and utilize WordNet to expand semantic meanings of text to boost up the generalization of detectors. We also explore audio features and extract the moti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006